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VALIDATION  OF  A  MONTE  CARLO  SIMULATION 
OF  BINARY  TIME  SERIES 


INTRODUCTION 

In  this  report  a  statistical  technique  is  presented  appropriate  for  the  validation  of  a  Monte 
Carlo  simulation  of  processes  such  as  detection  processes,  whose  results  can  be  reduced  to  a  finite 
sequence  of  thresholding  events  having  dichotomous,  i.e.,  binary,  outcomes.  Moreover,  the  valida¬ 
tion  technique  can  be  applied  even  when  the  process  being  simulated  cannot  be  experimentally 
repeated.  Thus  the  simulation’s  validity  for  the  case  examined  can  be  determined,  to  within  a 
specified  level  of  statistical  significance,  with  only  a  single  observation  of  the  real-world  process 
being  simulated.  The  statistical  technique  is  nonparametric,  it  does  not  assume  independence 
between  events  occurring  at  different  times,  and  it  does  not  require  the  assumption  of  any  station- 
arity  or  steady  state  behavior  of  the  process  simulated. 

The  validation  technique  can  be  briefly  described  as  follows.  Each  Monte  Carlo  replication  of 
the  simulation  model  produces  a  vector  of  m  binary  elements.  Based  on  this  sample  of  binary  vec¬ 
tors,  a  representation  is  obtained  of  the  probability  distribution  of  the  population  of  binary  vectors 
from  which  the  sample  was  drawn.  Using  this  representation  the  likelihood  of  occurrence  of  any 
vector rof  m  binary  elements  may  be  computed  under  the  hypothesis  that  it  comes  from  the  same 
statistical  population  as  the  vectors  generated  by  the  simulation  model.  In  particular  the  likelihood 
of  the  binary  vector  resulting  from  an  observed  run  of  the  actual  process  under  the  same  conditions 
that  are  represented  in  the  simulation  is  computed.  The  question  of  the  validity  of  the  simulation 
model,  at  tb  ?  significance  level  a,  is  then  resolved  by  observing  whether  the  probability  of  the 
experimentally  obtained  vector  exceeds  the  ath  percentile  of  the  probabilities  of  the  simulation¬ 
generated  vectors. 

The  statistical  technique  described  in  this  report  constitutes  a  new  application,  to  simulation 
model  validation,  of  previous  results  concerning  the  representation  of  the  probability  distribution 
of  dichotomous  experimental  responses.  Included  in  this  report  is  a  description  of  how  this  tech¬ 
nique  was  applied  to  the  statistical  validation  of  a  particular  simulation  model  used  by  the  U.S.  Navy 
to  represent  the  surveillance  performance  of  a  system  of  undersea  acoustic  sensors. 


VALIDATION  DEFINED 

There  is  considerable  diversity  of  opinion  on  what  constitutes  a  validation  of  a  simulation,  and, 
for  that  matter,  on  how  the  term  validation  is  defined.  This  report  follows  the  currently  accepted 
terminology,  as  in  Fishman  and  Kiviat  [1]  and  Steinhorst  and  Garratt  [2] ,  and  defines  validation  as 
“testing  the  agreement  between  the  behavior  of  the  simulation  model  and  the  real  system,”  as 
distinguished  from  verification,  which  is  taken  to  mean  “insuring  that  a  simulation  model  behaves 
as  the  experimenter  intends.” 

A  multitude  of  different  criteria  and  procedures  have  been  proposed  for  the  validation  of 
simulation  models.  Some  of  these  criteria  are  qualitative,  such  as  the  suggestion  by  Turing  [3]  that 
a  model  is  valid  if,  given  both  a  model’s  synthetic  output  and  nonmodeling  results  in  a  similar 
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format,  an  expert  cannot  discern  which  is  the  model  result  —  or  the  widely  used  standby  of  visual 
comparison  of  model  output  with  empirical  data. 

This  report  presents  the  view  that  a  quantitative  procedure  for  simulation  model  validation  is 
preferable.  A  quantitative  procedure  is  a  statistical  technique  whose  outcome  is  the  determination 
whether  the  simulation  model  forecast  does  or  does  not  agree  with  nonmodeling  results,  with  a 
specified  level  of  statistical  assurance.  A  great  variety  of  such  quantitative  procedures  have  been 
proposed  for  the  validation  of  simulations.  In  the  end,  the  choice  of  a  statistical  technique  of  vali¬ 
dation  must  be  governed  by  the  structure  of  the  model  output  data  and  the  available  real-world 
observations  of  the  system  being  modeled. 


STRUCTURE  OF  THE  VALIDATION  DATA 

The  simulation  model  in  question  was  designed  to  estimate  the  performance  of  an  acoustic 
sensor  in  the  detection  of  a  target  in  the  ocean.  A  principal  output  of  the  simulation  is  the  set  of 
instantaneous  probabilities  of  target  detection  by  the  sensor  at  each  hour  throughout  the  duration 
of  the  target’s  specified  maneuvers.  The  simulation  requires  a  substantial  collection  of  numerical 
data  as  input  to  the  computer  programs  that  make  up  the  model.  The  U.S.  Navy  maintains  a  data 
base  for  use  in  modeling  sensor  performance,  and  this  data  base  is  periodically  reviewed  and  ap¬ 
proved.  The  model  cannot  be  operated  without  input  data.  Therefore  it  is  appropriate  to  treat  the 
data  base  as  fixed  and  to  consider  the  combination  of  computer  programs  and  data  base  as  the 
simulation  model  to  be  validated. 

The  target’s  acoustic  characteristics  and  the  target  track,  i.e.,  the  history  of  the  geographical 
positions  of  the  target  at  each  hour,  are  provided  as  input  to  the  simulation  model.  The  model 
operates  by  replicating  this  track  many  times.  During  a  particular  replication,  at  each  hour,  the 
model  decides  that  the  sensor  either  is  detecting  or  is  not  detecting  the  target,  based  on  factors  such 
as  sensor-target  range  and  geometry;  the  acoustic  properties  of  the  ocean  in  the  sensor-target  vicin¬ 
ity;  sensor  alertment  due  to  possible  detections  at  previous  hours  of  this  replication;  and  the  magni¬ 
tude  of  a  Gauss- Markov  fluctuation  approximated  by  summing  three  independent  Ehrenfest  random 
walk  terms  (Feller  [4] )  at  each  hour.  Thus  for  a  target  track  lasting  m  hours,  a  single  replication  by 
the  model  produces  a  vector  of  m  elements,  each  element  being  either  1  or  0,  where  1  indicates  the 
sensor  is  detecting  and  0  not  detecting  the  target  at  a  particular  hour.  The  instantaneous  probability 
of  detection  at  a  given  hour  is  then  approximated  by  the  mean  over  all  replications  of  the  vector 
element  corresponding  to  that  hour.  The  structure  of  these  data  is  depicted  in  Fig.  1.  The  detection 
events  occur  at  equally  spaced  intervals  of  one  hour;  however,  the  word  “hour”  could  be  replaced 
by  the  term  “time  step”  wherever  it  occurs  without  affecting  the  accuracy  of  any  statement.  More¬ 
over  the  events  whose  outcomes  are  represented  by  the  binary  vector  elements  need  not  be  equally 
spaced  over  time. 

The  probability  of  detection  of  a  given  target  by  a  given  sensor  is  a  strong  function  of  the 
target’s  geographical  position.  This  fact,  in  conjunction  with  the  alertment  effect  of  prior  detections, 
makes  it  clear  that  a  high  degree  of  correlation  exists  between  the  detection  events  occurring  at 
consecutive  hours.  Moreover,  it  would  be  most  unusual  for  the  same  probability  of  detection  to 
obtain  at  all  points  of  a  target  track.  Since  a  moving  target  is  changing  its  position  and  aspect  rela¬ 
tive  to  the  sensor  as  time  progresses,  no  steady  state  will  be  reached  in  the  finite  duration  of  a  target 
track. 

Another  factor  affecting  the  selection  of  this  particular  validation  technique  is  the  lack  of  data 
concerning  realizations  of  the  simulated  process.  Multiple  realizations  of  the  same  target  track  are 
practically  impossible  to  obtain  for  the  targets  and  sensors  of  interest.  As  is  true  of  many  processes 
that  are  the  subjects  of  simulation,  the  expense  and  difficulty  involved  generally  preclude  making 


NRL  REPORT  85X7 


Elapsed  Tima  (hours) 

1 

2 

3 

4 

5 

6  .. 

.  239 

240 

Observed  Detection  History 

0 

0 

1 

X 

1 

0  . . 

1 

1 

Model 

Replication 

1 

0 

0 

1 

0 

0 

0 

0 

1 

2 

0 

1 

0 

0 

0 

0  .  . 

0 

1 

3 

0 

0 

1 

1 

0 

0  .. 

0 

1 

50 

0 

0 

1 

0 

0 

0  .. 

0 

0 

Predicted 

Detection  Probability 

0.00 

.20 

.60 

.20 

.00 

.00  .. 

.  .00 

.60 

Fig.  1  —  Structure  of  validation  data  for  an  individual  acoustic  sensor. 
240-h  track,  50  model  replications;  1:  detecting  the  target;  and  0:  not 
detecting  the  target 


repeated  controlled  experiments.  It  is  impossible  to  achieve  the  degree  of  control  necessary  to 
obtain  repeated  real-world  observations  which  would  have  identical  inputs  to  the  model.  Detection 
histories  for  many  different  targets  are  available,  but  since  the  target  tracks  differ  geographically, 
one  cannot  treat  the  detection  histories  of  different  targets  as  members  of  the  same  statistical  popu¬ 
lation.  Hence,  observed  samples  of  size  one  are  the  major  constituent  of  the  available  records  of  the 
detection  process  simulated  by  the  model. 

The  recorded  history  of  detections  of  a  given  target  by  a  given  acoustic  sensor  is  not  considered 
to  be  a  fixed  standard  to  be  matched  by  the  simulation.  The  fluctuations  in  the  physical  processes 
involved  in  the  origin,  transmission,  detection,  and  classification  of  an  acoustic  signal  require  the 
recorded  history  of  gains  and  losses  of  contact  with  the  target  to  be  regarded  as  a  sample  of  size 
one  from  a  random  population  of  unknown  distribution.  Hence,  one  is  testing  the  validity  of 
the  simulation  model  as  a  representation  of  the  statistical  structure  underlying  the  recorded  history 
of  detections.  That  is,  the  hypothesis  being  tested  is  that  the  unknown  statistical  distribution  of 
which  the  observed  history  of  detections  constitutes  a  single  sample  point  is  the  same  as  the  un¬ 
known  statistical  distribution  of  which  the  model’s  replications  constitute  many  sample  points. 


STATISTICAL  TECHNIQUE 

As  a  particular  target  traverses  the  surveillance  zone  it  generates  a  track  history  of  detections 
for  each  sensor  in  the  zone.  For  each  sensor  i  there  is  an  observed  vector  x,-  =  (x(1 , .  .  .,  xim )  of  0’s 
and  l’s  from  a  probability  distribution  p'0  and  the  simulation  generates  n  vectors  x  =  (*j, .  .  .,  xm ) 
of  0’s  and  l’s  from  a  probability  distribution  p,.  By  use  of  the  n  generated  vectors,  an  estimate 
p,  is  obtained  of  pr  The  simulated  data  are  applied  to  p,  to  obtain  the  sample  distribution  as  an 
approximation  to  the  population  distribution.  The  test  consists  of  determining  whether  the  ob¬ 
served  vector  x,  has  p, -value  in  the  upper  1  -  a  region  of  the  sample  distribution.  If  it  does,  the 
hypothesis  that  i,  comes  from  the  distribution  p,  is  accepted.  The  test  is  applied  for  all  i  and 
many  acceptances  that  i,  is  from  p,-  will  confirm  that  p,  is  a  good  approximation  to  p'0,  thereby 
validating  the  simulation.  It  is  to  be  expected  that  due  to  statistical  fluctuation  some  sensors  will 
fail  the  test.  Hence,  a  distinction  must  be  made  between  validation  of  the  simulation  model  in 
general  and  specific  statements  about  the  simulation  of  the  individual  sensors.  Statements  about  the 
latter  are  simply  understood  to  carry  the  uncertainty  inherent  in  the  statistical  test  itself. 
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The  statistical  hypothesis  test  uses  the  representation  by  Bahadur  [5]  and  Lazarsfeld  [6] 
for  the  probability  distribution  underlying  binary  sequences,  which  is  summarized  as  follows. 

Assume  the  target  track  is  m  hours  long.  Let  X  be  the  set  of  all  points  x  =  (x1 ,  x2,  . .  .,  xm ) 
with  each  x,  =  0  or  1,  and  suppose  p(x)  is  a  probability  distribution  on  the  elements  of  X,  that  is, 
p(x)  >  0  for  all  xeX and  2^€Xp(ic)  =  1.  Let  E„(  •)  denote  the  expected  value  of  the  expression  in 
parentheses  when  the  distribution  p  obtains.  Then  let 

< 

vi  =  Ep(xi )  0  <  vi  <1;«  =  1,  2,  .  . .,  m;  (1) 

and  | 

zi  =  “  viWvi  *  C1  “  vi)  i=l,2 . m.  (2) 

Next  define  the  family 

rijmEP(2i'zi >  i<i; 

rUk  m  Eplgi  ‘  sr  *k >  i<j<k;  (3) 

r12 ...*  =£p(Z  1  *Z2  *  •  •  ‘Zm)- 

For  x  =  (*! ,  x2, .  .  xm )  define 


and 


pt  !,(*)«  n£i  «ij  **(1  -  i;,)1"*1  (4) 

fix)  -  1  +  •  2,  •  Zj  +  *i<j<krijk  •  z,  •  zj  •  zk  (5) 

+  -..  +  r12.  ,.m  *Z1  'z2  ■  •  • 


Then  for  each  x  in  X, 


p(x)=p[1](x)f(x) .  (6) 

Thus  j  j  (i)  denotes  the  joint  probability  distribution  of  the  x-’s  under  an  assumption  that  the 
x-’s  are  independently  distributed,  and  f(x)  represents  the  effects  of  correlation. 

In  this  representation  it  is  natural  to  refer  to  the  parameters  r-y  as  second  order  correlations, 
to  the  parameters  as  third  order  correlations,  and  so  forth,  culminating  in  r1 2  m'  the  m-th 
order  correlation.  Tne  distribution  p  then  is  said  to  have  order  s  if  one  correlation  of  order  s  is  non¬ 
zero  and  all  correlations  of  order  greater  than  s  are  equal  to  zero.  If  a  distribution  is  known  to  be  of 
a  certain  order  s,  then  the  representation  (5)  need  only  extend  to  correlations  of  order  s  or  less. 
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The  representation  (5)  and  (6)  was  first  given  in  [6]  and  was  derived  by  induction  on  m. 

The  proof  shown  here  was  given  in  [5]  and  is  due  to  Bahadur. 

PROPOSITION  1.  For  every  jc  =  xm )  in  X 

P(x)  =  P[  i  j  . 

To  establish  proposition  1,  let  V  be  the  vector  space  of  real-valued  functions  g  on  X.  Consider 
V as  an  inner  product  space  with  inner  product  (hg)  and  norm  ||£||  =  (g>g)ll2,  where  the  inner 
product  is  defined  by 


(hg)=Ep  (h  •  g)  =  ^h(x)  ‘g(x) 

^  1  *  v/r  V 


It  follows  from  (1)  and  (2)  that  the  set 


S  j  l;Zj.  *2’  •  •  •>  *m  ;2122’  Zl23>  •  •  •’  2m  -  ’  2i2223>  •••**'•  -»21 


Z2  • • • 2m | 


of  functions  on  X  is  orthonormal,  i.e.,  ||g||  =  1  for  each  g  in  S,  and  (hg)  =  0  for  h  and  g  in  S  with 
hit  g.  Since  there  are  2m  functions  in  S,  since  V  is  2m  dimensional,  and  since  P[  i  ]  >0  for  each  x, 
the  following  proposition  holds: 


PROPOSITION  2.  The  set  S  is  a  basis  in  the  space  of  real-valued  functions  on  X.  This  basis 
is  orthonormal  when  pj  t  j  obtains. 

It  follows,  in  particular,  that  each  function  f  on  X  admits  one  and  only  one  representation 
as  a  linear  combination  of  functions  in  S: 


f=^(f,g)  ■  g- 

geS 


Now  take  f  =  p/pj  x  j .  Then 


(f,g)=  V'j'g'P[i 


xeX 


HI 


=  ^g’P 

xeX 


(7) 


=  Ep(g) 

for  all  g.  Since  £p(l)  =  1  and  E„(zt )  =  0  for  i  =  1, . .  .,  m  by  (1)  and  (2),  it  follows  from  the  pre¬ 
ceding  paragraph  and  (7)  that  (5)  holds,  with  the  coefficients  defined  by  (3).  This  establishes 
PROPOSITION  1. 

In  some  applications  the  nature  of  the  situation  being  studied  or  computational  problems 
might  make  it  necessary  to  assume  a  specified  order  to  the  distribution,  even  though  the  value  of 
that  order  cannot  be  known  precisely.  If  the  selection  is  in  error,  then  the  “truncated”  form  of 
expression  (5)  will  be  in  error  and  so  will  the  resulting  values  of  f(x)  and  p(x).  As  defined  by  (6), 
the  estimated  p(x)  may  not  be  a  probability  distribution  and  may  assume  negative  values  for 
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some  x.  This  point  is  discussed  by  Bahadur  [5] .  In  addition,  to  obtain  the  estimate  p  of  (6)  one 
must  first  obtain  the  estimate  f  of  (5).  The  statistical  fluctuation  associated  with  the  values  f  (x) 
is  another  source  of  error  leading  to  negative  values  of  f(x)  for  some  x  and  consequently  for  p  (x). 
Because  p  may  not  be  a  probability  distribution,  values  taken  by  p  are  referred  to  as  likelihood 
values. 

In  the  present  application  several  considerations  led  to  truncating  the  form  of  (5).  A  fourth- 
order  approximation  to  the  Bahadur-Lazarsfeld  representation  has  been  employed,  truncating  the 
expression  in  (5)  after  the  fourth-order  correlations  and  using  a  time  window  of  12  h,  that  is,  assum¬ 
ing  zero  correlation  between  time  steps  more  than  12  h  apart.  From  observations  of  the  detection 
process  it  seems  reasonable  that  the  correlation  between  instants  separated  by  more  than  12  h  is 
negligible  compared  to  the  correlations  between  instants  closer  together,  and  the  contribution  of 
correlations  of  order  greater  than  four  is  relatively  insignificant.  The  truncation  has  also  been 
necessary  in  order  to  keep  the  computer  costs  within  reason. 

The  need  has  been  established  to  develop  a  methodology  using  experimental  evidence  to 
estimate  the  proper  size  of  the  time  window  and  the  order  of  correlation.  A  Bayesian  statistical 
method  has  been  developed  by  Haskell  [7]  to  be  applied  to  detection  data  in  estimating  sensor 
performance  values  for  various  measures  of  effectiveness.  The  technique  models  the  thresholding 
events  as  a  stochastic  process  in  continuous  time  with  a  well  defined  autocorrelation  function.  The 
autocorrelation  function  is  characterized  by  one  parameter  which  the  technique  estimates.  Apply¬ 
ing  this  method  to  the  real-world  observations  of  specific  sensors  whose  performance  is  being 
simulated  could  provide  a  more  precise  assessment  of  the  correlation  order  and  length  of  the  time 
window  of  the  detection  process. 


APPLICATION  TO  VALIDATION 

The  observed  realization  to  be  compared  with  the  simulation  results  consists  of  a  track  history 
of  duration  m  hours  together  with  the  associated  detection  history.  The  simulation  model  is  pro¬ 
grammed  to  run  with  input  parameters  characterizing  the  observed  situation.  Then  n  replications 
of  the  model  are  run,  each  producing  a  time  series  (vector)  x  g  =  (xgi ,  xg2, .  • .,  xgm ), 
g  =  1,  2, .  .  .,  n,  of  m  binary  elements,  where  1  denotes  a  detection  and  0  no  detection  of  the  target 
by  the  acoustic  sensor.  Actual  values  of  m  and  n  used  are  m  =  240  and  n  =  50.  The  n  binary  vectors 
are  used  to  estimate  the  parameters  in  the  Bahadur-Lazarsfeld  representation  of  the  probability 
distribution  corresponding  to  the  population  from  which  the  n  sample  vectors  are  generated.  Simple 
unbiased  estimators  were  chosen  for  all  parameters.  The  estimators  for  the  u,’s  are  maximum  likeli¬ 
hood  when  the  distribution  j  obtains,  that  is,  when  the  xgi's  are  independently  distributed. 
Since  the  w,-*s  are  assumed  to  be  neither  0  nor  1,  a  reasonable  correction  is  made  should  the  data 
seem  to  indicate  they  are.  The  estimates  are  obtained  by: 


1/2  * 

if 

1  -  d/2 n) 

“  P“'n 

(8) 

otherwise  for  i  =  1,  2,  . . m; 
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and 


N 

II 

*F 

i 

w 

s 

• 

i 

.5? 

i  =  1,  2, .  .  .,  m, 

*=1.2 . n; 

(9) 

1  <  i  <y<  m; 

(10) 

7ijk  =d/")  zgk 

l<i<j<k  <m; 

ri2...m  =(l/«)^gX  -zg2'..--gm. 


The  likelihood  p .  of  the  g-th  model  replication  x  is  given  by 


Pg  =P(xg)=f(xg) 


■  m  *  1-x  1 


g=  1,  2 . n. 


(11) 


with  f( x  )  as  characterized  by  (5)  and  the  z-values  as  given  by  (9).  Then  under  the  hypothesis  that 
the  random  mechanism  for  the  simulation  is  a  model  for  the  random  mechanism  underlying  the 
observed  sensor’s  detections,  the  recorded  sequence  of  detections  of  that  target  by  the  specified 
sensor,  tne  binary  vector  x  =  (Xj ,  *2, . .  .,  *m ),  has  likelihood  q  =  p(x)  —  relative  to  the  Bahadur- 
Lazarsfeld  representation  of  the  n  model  replications  given  by  (6),  using  the  v’s  and  r ’s 
computed  by  (8)  and  (10)  from  the  model  replications.  Once  the  numbers  q  =  p(x)  and 
/ p  =  p(x  )  :  g  -  1,  2, . .  .,  n }  have  been  obtained,  it  can  be  determined  whether  to  accept  the 
simulation* model  as  valid  at  a  significance  level  a.  The  test  procedure  (a  straightforward  rank  test) 
is  to  reject  the  hypothesis  of  association  if  the  observed  value  q  falls  below  the  a-th  percentile  of 
computed  values  p„.  Define  N  to  be  the  number  of  elements  in  the  set  {pg  .  p«  ^  O'. 1  ^ g  ^  n} • 
Then  if  N  >  not,  the  simulation  model  is  determined  to  be  valid  in  predicting  tne  performance  of 
the  specified  acoustic  sensor  in  detecting  that  target.  The  model  is  rejected  as  not  valid  at  the  a 
significance  level  if  N  <  not.  Because  the  range  of  likelihood  values  spans  several  orders  of  magni¬ 
tude,  plots  of  the  cumulative  distribution  are  generally  done  in  the  form  “log  likelihood  vs  cumu¬ 
lative  probability.”  The  value  of  N  is  obtained  by  first  arranging  the  sequence  log  (p?),  g  =  1,  .  .  .,  n, 
in  ascending  order,  comparing  each  member  with  log  (g),  and  updating  N  until  the  first  value  is 
found  that  exceeds  log  ( q ).  Suppose,  for  example,  a  -  0.10  is  specified.  Then  for  n  -  50,  the  cri¬ 
terion  says  that  for  the  simulation  to  be  accepted  as  a  reasonable  model,  the  number  N  (the  number 
of  replications  whose  likelihood  is  no  greater  than  q)  must  be  at  least  5.  The  reason  a  one-sided  test 
is  required  here  is  that  the  higher  the  likelihood  of  the  recorded  detection  history  relative  to  the 
Bahadur-Lazarsfeld  representation  of  the  model  replications,  the  better  the  agreement  is  between 
the  recorded  detection  history  and  the  simulation  output.  Hence  one  need  only  be  concerned  with 
rejecting  the  simulation  model  if  the  likelihood  of  the  recorded  detection  sequence  is  low  relative 
to  the  likelihoods  of  the  model  replications. 


As  mentioned  before,  the  approximation  to  (5)  can  be  negative  for  some  vectors  x,in  which 
case  the  hypothesis  test  cannot  always  be  decided.  Fortunately  such  occurrences  turned  out  to  oe 
very  rare.  An  alternate  statistical  technique  was  developed  to  deal  with  the  very  few  instances  in 
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which  it  happened.  An  additional  alternative  to  deal  with  this  problem  would  have  been  to  add 
correlation  terms  of  higher  order  until  the  Bahadur-Lazarsfeld  representation  yielded  a  positive 
likelihood  (if  at  all).  This,  however,  is  not  guaranteed  to  work  since  all  parameters  are  estimated. 

It  was  deemed  too  expensive  to  try  for  the  very  little  reward  to  be  obtained  in  incorporating  just 
a  few  more  cases.  The  simulation  validation  results  are  derived  from  the  overwhelming  majority 
of  the  cases  in  which  the  technique  presented  here  was  applied  successfully. 

Figures  2  and  3  illustrate  an  application  of  the  Bahadur-Lazarsfeld  representation.  The  sample 
log  likelihood  vs  cumulative  probability  distribution  curve  is  drawn  using  the  likelihoods  assigned  by 
the  Bahadur-Lazarsfeld  representation  to  the  model  replications.  The  critical  region  is  determined 
from  the  likelihood  at  which  the  cumulative  probability  curve  reaches  the  tenth  percentile  (a  =  0.1 ). 
With  n  =  50  replications,  if  the  replications  are  numbered  in  order  of  increasing  likelihood  the 
boundary  of  the  critical  region  is  the  likelihood  of  the  fifth  replication.  In  Fig.  2  the  log  likelihood, 
-62.7,  of  the  reported  sequence  of  detections  of  that  target  by  that  acoustic  sensor  exceeds  the  log 
likelihood,  -67.9,  determining  the  critical  region,  so  the  simulation  predictions  are  accepted  as  a 
good  fit  to  the  recorded  observations.  In  Fig.  3  the  likelihood  of  the  reported  sequence  is  less  than 
the  likelihood  that  bounds  the  critical  region,  so  the  simulation  model  in  that  case  is  rejected  as  not 
valid. 

Users  of  the  simulation  model  commonly  run  50  replications  of  a  target  track;  but  before 
conducting  a  validation  it  was  necessary  to  determine  whether  50  model  replications  constitute  a 
large  enough  sample  with  which  to  perform  the  validation  hypothesis  tests.  Figure  4  shows  the  dis¬ 
tribution  of  likelihoods  (relative  to  their  respective  Bahadur  representations)  of  three  independent 
sets  of  50  replications  for  the  same  target  and  acoustic  sensor.  This  case  seemed  to  exhibit  more 
variability  than  usual  between  replications,  yet  the  Smirnov  two-sided  test  for  goodness-of-fit 
(at  the  0.20  significance  level)  indicated  that  the  three  samples  could  be  assumed  to  have  come  from 
the  same  population.  Thus,  it  was  concluded  that  50  replications  of  the  model  are  sufficient  for 
validation  purposes.  On  the  other  hand,  Fig.  5  is  a  similar  illustration  of  the  empirical  distributions 
of  three  independent  sets  of  20  replications  each  for  the  same  target  and  detecting  sensor.  Although 
these  samples  pass  the  Smirnov  two-sided  goodness-of-fit  test  at  0.10  significance  level,  at  the  0.20 
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significance  level  the  test  indicates  the  three  samples  did  not  come  from  the  same  population.  Thus, 
it  was  concluded  that  the  variability  between  samples  of  only  20  replications  was  too  great  to  permit 
their  use  in  a  validation. 


VALIDATION  RESULTS 

The  technique  just  described  was  applied  to  determine  the  validity  of  the  simulation  model  in 
representing  the  detection  performance  of  several  different  acoustic  sensors.  To  be  acceptable  as 
valid,  a  model  such  as  this  should  be  able  to  represent  any  of  a  wide  range  of  situations  when  given 
the  characterizing  parameters.  Thus  the  true  validation  test  lies  in  determining  how  well  the  model 
performs  over  a  number  of  cases. 

The  tracks  of  nine  different  targets,  whose  detection  histories  were  available,  were  simulated 
by  the  model,  and  50  Monte  Carlo  replications  of  each  target  track  were  produced.  The  detection 
performance  against  these  nine  targets  by  the  set  of  acoustic  sensors  that  were  operating  was  pre¬ 
dicted  by  the  model.  The  validation  test  was  applied  to  the  50  detection  performance  vectors 
produced  by  the  simulation  for  each  sensor,  along  with  the  record  of  actual  detections  by  each 
sensor.  The  results  of  these  validation  tests  are  summarized  in  Table  1.  In  this  table  a  “+”  symbol 
indicates  those  cases  in  which  no  detections  of  a  particular  target  by  a  particular  sensor  were  re¬ 
corded.  In  all  such  cases  the  simulation  model  predicted  a  low  enough  level  of  detections  to  be 
accepted  as  valid.  The  symbol  denotes  those  cases  where  the  model  forecast  was  a  good  fit 
to  the  recorded  history  of  detections  and  where  there  were  detections  of  that  target  by  that  sensor 
recorded.  The  symbol  “0”  denotes  those  cases  in  which  the  simulation  model  output  was  found  to 
be  an  inadequate  fit  to  the  observed  record  of  detections.  Table  1  shows  that  in  the  great  majority 
of  sensor-target  combinations  the  predictions  by  the  simulation  model  are  a  good  fit  to  the  recorded 
observations,  at  the  0.10  significance  level.  But  one  can  identify  certain  sensors,  particularly  C,  D, 

M,  and  R,  whose  performance  is  estimated  inadequately  by  the  model.  The  representation  of  these 
sensors  within  the  data  base  of  the  simulation  model  has  been  singled  out  for  investigation  and 
possible  improvement. 
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Table  1  —  Validation  Results 
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0 

* : Predictions  accepted  at  0.10  significance  level.  Detections  recorded. 

0:  Predict  ions  rejected  at  0.10  significance  level. 

-^Predictions  accepted  at  0.10  level,  but  no  detections  recorded  by  this  sensor. 


These  results  suggest  the  following  hypothesis  about  the  model.  It  is  to  be  recalled  that  the 
model  was  described  as  comprising  both  a  computer  routine  and  a  set  of  numbers  that  characterize 
the  situations,  the  sensors,  and  the  targets.  In  a  large  family  of  cases  the  hypothesis  of  a  good  fit 
between  model  results  and  observed  results  was  accepted;  this  would  tend  to  indicate  both  a  reason¬ 
ably  good  family  of  mathematical  routines  and  good  supporting  numbers  in  the  model.  On  the 
other  hand,  in  other  cases  the  combination  of  the  same  mathematical  routines  with  other  supporting 
values  produced  results  for  which  the  hypothesis  of  a  good  fit  was  rejected.  Consequently  it  seems 
worthwhile  to  postulate,  as  a  hypothesis  for  further  investigation,  that  the  mathematical  founda¬ 
tions  of  the  simulation  model  are  reasonable  (or  “valid”),  but  that  the  values  characterizing  some  of 
the  sensors  are  in  error. 

It  is  also  worth  noting  that  certain  targets,  such  as  number  2  and  number  8,  show  a  relatively 
high  rate  of  rejection  of  the  goodness  of  fit  hypothesis.  It  is  possible  that  the  values  used  in  the  model 
to  characterize  these  targets  were  in  error.  However,  eliminating  these  targets  from  the  data  base 
will  not  make  the  results  from  all  the  sensors  acceptable. 

In  addition  to  examining  the  individual  sensor  performance  in  the  model,  the  validation 
methodology  can  be  used  to  validate  the  results  for  families  of  sensors.  Note  that  for  validation 
purposes  the  detection  performance  of  an  entire  set  of  sensors  can  be  treated  in  exactly  the  same 
way  as  those  for  an  individual  sensor  simply  by  defining  a  “system  detection”  whenever  at  least 
one  of  the  set  of  sensors  is  detecting  the  target.  By  use  of  this  scheme  validation  results  were 
obtained  for  the  simulation  model’s  predictions  of  system  detection  performance;  these  appear  in 
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the  bottom  row  of  Table  1.  The  number  of  invalid  estimates  of  system  detection  performance 
again  suggests  that  the  simulation  model  with  the  present  data  base  should  not  be  considered 
completely  valid.  When  the  input  data  representing  certain  sensors  in  the  model  have  been  cor¬ 
rected  or  refined,  the  model  should  be  reexamined  for  validity. 


PROPERTIES  OF  THE  TEST 

The  properties  of  the  statistical  test  applied  to  validate  the  simulation  remain  open  for 
further  investigation.  Of  immediate  interest  is  whether  specifying  the  a-th  percentile  of  the  sample 
distribution  does  result  in  a  probability  of  rejection  equal  to  a.  In  addition  there  are  such  concerns 
as  to  how  one  should  deal  with  alternative  hypotheses  and  what  kind  of  power  does  the  test  ex¬ 
hibit  in  evaluating  alternatives.  Only  partial  answers  have  been  obtained  to  some  of  these  questions. 

Under  the  assumption  that  the  simulation  model  is  valid,  the  statistical  test  for  a  specific 
sensor-track  combination  constitutes  a  Bernoulli  trial  with  rejection  probability  a  and  probability 
of  no  rejection  1  -  a.  Here  the  sample  distribution  (of  the  n  replications)  is  used  as  an  approxima¬ 
tion  to  the  theoretical  distribution;  hence  the  actual  rejection  probability  may  appear  to  be  data- 
driven.  However  the  repetitions  are  identically  distributed.  Repetitions  of  the  test  itself  under 
identical  conditions  should  generate  a  proportion  of  rejections  approximating  the  probability  of 
rejection  as  the  number  of  repetitions  increase.  In  general  there  are  n  +  1  vectors  each  of  length  m. 
Ideally  the  probability  of  rejection  should  equal  or  when  the  n  +  1st  vector  (sample  vector)  is  from 
the  same  distribution  as  the  first  n  and  it  should  equal  the  power  of  the  test  when  the  n  +  1st 
sample  is  not  from  the  same  distribution.  To  make  inferences  about  the  probability  of  rejection, 
successive  repetitions  of  the  same  test  may  be  generated.  The  repetitions  should  be  grouped  into 
subsets  of  equal  size  from  which  a  proportion  of  rejections  may  be  computed  for  each  one.  The 
computed  proportions  are  a  sequence  of  iid.  random  variables.  Taking  a  large  enough  number  of 
subsets  one  can  then  appeal  to  the  Central  Limit  Theorem  and  use  classical  statistical  techniques 
to  make  inferences  on  the  probability  of  rejection  that  results  when  applying  the  given  test  at  a 
specific  level  a. 

A  preliminary  evaluation  of  the  properties  of  the  test  has  been  conducted  with  the  use  of  the 
computer.  Sets  of  n  +  1  random  binary  vectors  were  repeatedly  generated  from  a  known  distribu¬ 
tion  of  correlation  order  1  and  the  test  procedure  was  applied  with  a  specified  level  a  to  determine 
whether  the  n  +  1st  vector  did  or  did  not  belong  to  the  population  from  which  the  first  n  vectors 
came.  The  evaluation  was  conducted  for  various  values  of  m  and  various  values  of  n  for  each  m. 

For  the  cases  considered,  the  results  appear  to  indicate  that  the  probability  of  rejection  is  actually 
less  than  a  for  m  =  2  or  3,  but  it  approaches  a  from  below  as  m  increases  and  is  practically  a  for 
m-  5  and  6.  This  conclusion  however  cannot  be  accepted  as  general  because  the  analysis  was 
limited  to  a  few  known  distributions  of  very  simple  structure.  In  a  similar  fashion  the  n  +  1st 
vector  was  then  generated  from  a  known  distribution  of  order  1  other  than  the  distribution  from 
which  the  first  n  vectors  were  generated.  The  same  procedure  was  applied,  where  the  proportion 
of  rejections  now  estimates  the  power  of  the  test  against  a  known  alternative.  Again,  the  analysis 
was  limited  but  the  test  technique  appears  to  discriminate  very  well. 

An  interesting  problem  revealed  in  the  evaluation  is  that  vectors  with  negative  likelihoods 
arise  as  a  function  of  the  fluctuation  of  f(x)  around  its  theoretical  value.  In  the  cases  considered 
the  theoretical  value  is  f(x)  =  1.  As  n  increases  the  distribution  of  the  n  values  f(x)  tends  towards 
a  spike  at  1  and  the  number  of  vectors  resulting  in  negative  likelihoods  decreases  or  disappears. 

In  the  future  more  formal  attention  should  be  paid  to  this  problem  since  it  is  relevant  in  assessing 
the  adequacy  of  the  Bahadur-Lazarsfeld  representation  for  practical  application  purposes. 
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Of  a  more  general  nature  is  one  particular  data  structure  that  shows  that  the  probability  of 
rejection  approaches  a  asm,  the  number  of  columns  increase.  The  analysis  consists  of  letting 
n  =  2m  where  all  binary  vector  columns  consist  of  independent,  identically  distributed  Bernoulli 
random  variables  with  the  probability  of  a  1  given  by  v.  There  are  (2m  )2m  possible  outcomes 
from  which  the  Bahadur-Lazarsfeld  representations  may  be  obtained.  To  each  of  these  outcomes 
one  may  associate  2m  possible  n  +  1st  or  observed  vectors  yielding  a  total  of  (2m  )2m  + 1  elementary 
outcomes  to  be  considered.  First  it  is  assumed  that  the  observed  vectors  come  from  the  same 
distribution  as  the  sets  of  first  n  vectors.  Specifying  a,  evaluating  N  (as  defined  in  application  to 
validation)  for  each  outcome  and  applying  the  rules  of  the  test  one  may  establish  an  association 
between  a  and  the  actual  probability  of  rejection.  Table  2  lists  all  possible  outcomes  when  per¬ 
forming  the  test  for  the  case  m  =  1.  In  this  case  the  probability  of  rejection  is  independent  of  a. 

For  any  ae  (0,1]  the  probability  of  rejection  is  given  by  pr(v)  =  u(l  -  u)  depending  only  on  v. 

The  function  pr{v)  has  domain  [0,1] ,  is  concave,  and  is  symmetric  around  the  point  v  =  1/2  where 
it  achieves  its  maximum  of  1/4.  If  it  is  now  assumed  that  the  observed  vector  comes  from  a  different 
distribution,  say  a  Bernoulli  random  variable  with  the  probability  of  a  1  given  by  t,  then  the  power 
of  the  test  is  also  independent  of  a  and  given  by  1  -  j3  =  e2(l  -  f)  +  (1  -  v)2t.  The  power  of  the 
test  depends  only  on  the  relative  values  of  v  and  t.  For  larger  values  of  m  the  number  of  elementary 
outcomes  to  consider  increases  very  rapidly,  yet  one  can  try  to  discern  a  pattern  from  the  cases 
m  =  1,  2  and  3.  This  is  best  illustrated  by  considering  what  happens  when  m  =  2.  The  values  of  N 
are  either  0, 1,  2  or  4.  This  means  that  for  values  of  a  in  the  intervals  /j  =  (0, 1/4] ,  /2  =  (1/4,  1/2] 
and  I3  =  (1/2, 1]  there  correspond  three  different  probability  of  rejection  functions  prl(c),pr2(u), 
and pr3( v),  depending  only  on  v.  These  functions  have  range  [0,  hx  ] ,  [0,  /i2 ]  and  [0,  h3  ]  respec¬ 
tively  where  h1  e  7j ,  h2  e  /2,  and  h3  e  I3.  The  functions Pr2(u)  and  Pr3(v)  corresponding  to  the 
larger  values  of  a  are  concave  and  symmetric  around  v  =  1/2  where  they  achieve  their  maxima  h2 
and  h3 .  As  a  gets  smaller,  in  this  case  ae/j ,  the  function  prl  (u)  begins  to  behave  in  a  different 
manner.  It  remains  symmetric  around  v  =  1/2  but  in  this  instance  it  becomes  bimodal.  It  jumps 
to  a  quicker  maximum  achieved  at  about  i;  =  0.3  (also  V  -  0.7)  and  remains  fairly  close  to  its 
maximum  for  values  of  v  between  v  =  0.3  and  v  =  0.7.  For  values  of  m  >  1  the  number  of  non¬ 
overlapping  subintervals  covering  (0, 1]  where  a  may  be  specified  is  given  by  2m  -  1.  The  right¬ 
most  subinterval  always  has  length  l/2m~ 1 .  All  others  have  length  l/2m .  The  length  of  the  sub¬ 
intervals  decreases  rapidly  and  to  each  one  corresponds  a  unique  probability  of  rejection  function 
pr(v)  with  range  [0,  h]  where  h  is  contained  in  the  subinterval  where  a  assumes  its  value.  The 
functions  pr(v)  depend  only  on  v  and  are  symmetric  around  v  =  1/2.  One  can  only  speculate  as  to 
the  general  pattern  followed  by  the  functions  Pr(v)  as  m  increases.  As  m  increases  h  -*  a  and  it 
appears  that  for  smaller  a's  the  functioffc  pr(i>)  tend  to  achieve  their  maximum  rapidly  and  remain 
close  to  this  maximum  for  values  of  v  between  the  modes  perhaps  approximating  a  rectangular 
shape.  This  behavior  supports  that  observed  from  the  computer  evaluation.  If  this  is  the  case,  as 
m  gets  large  and  a  gets  small  the  probability  of  rejection  approaches  a.  The  functions  pr(v)  for  the 
case  m  =  2  are  plotted  in  Fig.  6. 
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Fig.  6  —  Functions  pr(v)  for  the  case  m  -  2 

The  analysis  is  by  no  means  complete.  So  far  only  special  cases  with  relatively  simple  struc¬ 
tures  have  yielded  some  information  on  the  nature  of  the  statistical  test.  It  is,  however,  a  chal¬ 
lenging  problem  that  remains  open  for  further  consideration. 

SUMMARY 

In  summary,  a  method  has  been  developed  for  the  statistical  testing  of  a  Monte  Carlo  simu¬ 
lation  whose  output  can  be  reduced  to  time  series  of  binary  data.  This  method  has  been  imple¬ 
mented  on  a  high-speed  digital  computer  and  has  been  successfully  applied  to  determine  that  a 
particular  simulation  model,  in  combination  with  its  approved  data  base,  should  not  be  considered 
valid.  The  validation  technique  presented  here  is  applicable  to  simulations  in  a  wide  range  of  sub¬ 
jects,  especially  sonar  systems,  radar  systems,  other  detection  processes,  and  other  processes 
which  involve  threshold  crossing  criteria  to  establish  binary  (“yes”  or  “no”)  outputs. 
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